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Abstract 

Computing systems have become increasingly complex with the 
emergence of heterogeneous hardware combining multicore CPUs 
and GPUs. These parallel systems exhibit tremendous computa¬ 
tional power at the cost of increased programming effort. This re¬ 
sults in a tension between achieving performance and code porta¬ 
bility. Code is either tuned using device-specific optimizations to 
achieve maximum performance or is written in a high-level lan¬ 
guage to achieve portability at the expense of performance. 

We propose a novel approach that offers high-level program¬ 
ming, code portability and high-performance. It is based on algo¬ 
rithmic pattern composition coupled with a powerful, yet simple, 
set of rewrite rules. This enables systematic transformation and op¬ 
timization of a high-level program into a low-level hardware spe¬ 
cific representation which leads to high performance code. 

We test our design in practice by describing a subset of the 
OpenCL programming model with low-level patterns and by im¬ 
plementing a compiler which generates high performance OpenCL 
code. Our experiments show that we can systematically derive 
high-performance device-specific implementations from simple 
high-level algorithmic expressions. The performance of the gen¬ 
erated OpenCL code is on par with highly tuned implementations 
for multicore CPUs and GPUs written by experts. 

Keywords Algorithmic Patterns, Rewrite Rules, Performance 
Portability, GPU, OpenCL, Code Generation 


1. Introduction 

Computing systems have become extremely complex and diversi¬ 
fied implementing different forms of parallelism and memory hier¬ 
archies. Modem multicore CPUs and GPUs (Graphics Processing 
Units) are often used for general purpose computations. The draw¬ 
back of such systems is the extreme difficulty of programming and 
extracting performance, requiring a deep understanding of the hard¬ 
ware. Software written and tuned for today’s systems needs to be 
adapted frequently to keep pace with ever changing hardware. 

Over the years, a wide range of languages, language extensions 
and frameworks have emerged for programming GPUs and other 
massively parallel devices. The two most common languages are 
CUBA and OpenCL, both directly exposing low-level hardware 
features. Directive based approaches such as OpenACC [33] and 
OpenMP [25], extensions to existing programming languages such 
as Cilk [3] or libraries like Intel TBB [32] have been proposed to 
reduce the complexity of developing code for multicore CPUs and 


GPUs. While these latter approaches simplify the development of 
applications, they all lead to an explosion of specialized implemen¬ 
tations where the same algorithmic concept is tuned differently for 
each device. As a result, performance portability remains elusive; 
code optimized for one device might only achieve a fraction of the 
performance on a different device. 

Several high-level programming models have been proposed to 
address this issue. Petabricks [29] allows the programmer to ex¬ 
press different algorithm implementations and automatically picks 
the best one using auto-tuning. Higher-level dataflow programming 
language such as Streamit [21] or LiquidMetal [15] have been de¬ 
signed with a similar goal in mind. Both languages use dedicated 
backend compiler for different hardware targets such as GPUs. 
Nvidia’s NOVA [9] language takes a more functional programming 
approach where algorithmic patterns such as map or reduce are ex¬ 
pressed as primitives recognized by the backend compiler. While 
definitively a step in the right direction, all these approaches rely on 
ad-hoc techniques such as hard-coded device-specific implemen¬ 
tations or heuristics. When hardware changes occur, the backend 
compiler has to be re-tuned or re-engineered. 

The root of the problem lies in a gap in the system stack be¬ 
tween high-level algorithmic concepts on the one hand and low- 
level hardware paradigms on the other hand. In this work we pro¬ 
pose to bridge this gap by defining a set of rewrite rules which 
systematically translates high-level algorithmic concepts into low- 
level hardware paradigms, both expressed as functional patterns. 
The rewrite rules are used to systematically derive semantically 
equivalent low-level expressions from high-level algorithm expres¬ 
sions written by the programmer. Once derived, we can automati¬ 
cally generate high performance code based on these expressions. 
Our approach is similar in spirit to Spiral [30], but relies on fine 
grain hardware patterns representing CPU and GPU hardware fea¬ 
tures. As a result, in our approach code generation becomes very 
simple since all optimization decisions are handled during the au¬ 
tomatic rewriting process and no complex analysis is performed. 

The power of our approach lies in the rewrite rules, written once 
by an expert system designer. These rules encode the different algo¬ 
rithmic choices and low-level hardware specific optimizations. The 
rewrite rules play the dual role of enabling the composition of al¬ 
gorithmic patterns and enabling the lowering of these patterns onto 
the low-level hardware paradigms. This results in a clear separation 
of concerns between high-level algorithmic patterns and low-level 
hardware paradigms. The rewrite rules define an implementation 
space that can be systematically searched to produce high perfor¬ 
mance code. We believe these principles pave the way to fully au¬ 
tomated portable high performance code generation. 
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Figure 1: Overview of our system. The programmer expresses the 
problem with high-level algorithmic patterns. These are system¬ 
atically transformed into low-level OpenCL patterns using a rule 
rewriting system. OpenCL code is generated by mapping the low- 
level patterns directly to the OpenCL programming model repre¬ 
senting hardware paradigms. 

This paper demonstrates the practicality of our approach using 
OpenCL as our target hardware platform. We compare our ap¬ 
proach with highly-tuned linear algebra functions extracted from 
the state-of-the-art libraries and with larger benchmarks such as 
BlackScholes. We express them as compositions of high-level al¬ 
gorithmic patterns which are systematically lowered to low-level 
OpenCL patterns from which OpenCL code is generated. The per¬ 
formance of our generated code is competitive with highly-tuned 
BLAS linear algebra libraries such as Nvidia GPU CUBLAS, 
AMD GPU clBLAS and Intel MKL on the CPU. 

Our paper makes the following key contributions: 


def mul3{x) = x 
def vectorScal = 


3 // user-defined function 

map{inul3) // map pattern 


(a) High-level expression written by the programmer. 

^ rewrite rules ^ 


def mul3(x) = x 
def vectorScal = 


join o map-workgroupl 
asScalar o map-local( 
vectorize-4(mul3) 

) o asVector-4 
) o split-1024 


(b) Low-level expression systematically derived using rewrite rules. 
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int4 mul3(int4 x) { return x * 3; } 
kernel vectorScal{global int* in,out, int len){ 
for (int i=get_group_id; i < len/1024; 
i-r=get_num_groups) { 
global int* grp in = in+(i*1024); 
global int* grp out = out+(i*1024); 
for (int j=get_local_id; j < 1024/4; 
j-r=get_local_size) { 

global int4* in vec4 =(int4*)grp in+(j*4); 
global int4* out vec4=(int4*)grp out+(j*4); 
*out_vec4 = mul3(*in_vec4); 

} } } 


(c) OpenCL program produced by our code generator. 

Figure 2: Pseudo-code representing vector scaling. The user sim¬ 
ply maps the mul3 function over the elements of the input array (a). 
This high-level expression is systematically transformed into a low- 
level expression (b) using rewrite rules. Finally, our code generator 
turns the low-level expression into an OpenCL program (c). 


• design of high-level algorithmic patterns used by the pro¬ 
grammer and low-level OpenCL patterns representing the 
OpenCL programming model; 

• develop a powerful set of rewrite rules that systematically 
expresses algorithmic and optimization choices; 

• achieve performance portability by systematically applying 
rewrite rules to derive device-specific implementations, leading 
to performance on par with the best hand-tuned versions. 

The paper is structured as follows. Section 2 provides a motivation. 
Sections 3 and 4 present our patterns and rewrite rules. Section 5 
and 6 show our benchmarks and rules in action. Our experimental 
setup and performance results are shown in Sections 7 and 8. 
Finally, Section 9 discusses related work and Section 10 concludes. 

2. Motivation 

The overview of our approach is presented in Figure 1. The pro¬ 
grammer writes a high-level expression composed of algorithmic 
patterns. Using a rewrite rule system, we systematically lower 
this high-level expression into a low-level expression consisting of 
OpenCL patterns. In this rewrite stage algorithmic and optimiza¬ 
tion choices in the high-level expression can be explored. The gen¬ 
erated low-level expression is then fed into our code generator that 
emits an OpenCL program. This program is finally compiled to ma¬ 
chine code by the vendor provided OpenCL compiler. 

We now illustrate the advantages of our approach using a simple 
vector scaling example shown in Figure 2. The user expresses 
the computation by writing a high-level expression using our map 
algorithmic pattern as shown in Figure 2a. This coding style is 
similar to functional and dataflow programming. 


Our technique first rewrites the user provided high-level expres¬ 
sion into something closer to the OpenCL programming model. 
This is achieved by applying the rewrite rules presented later in 
Section 4. Figure 2b shows one possible derivation of the origi¬ 
nal high-level expression where the o operator represents function 
composition, i.e., f o g(x) = f{g{x)). Starting from the last line, 
we first split the input into chunks of 1024 elements. Each chunk is 
mapped onto a group of threads, called workgroup, with the map- 
workgroup low-level pattern (line 2). Within a workgroup (lines 3- 
5), we vectorize the elements (line 5), each mapped to a local thread 
inside a workgroup via the map-local low-level pattern (line 3). 
Each local thread now processes 4 elements, enclosed in a vector 
type. Finally, the vectorize-4 pattern (line 4) implies that the user 
defined function niul3 is vectorized. The exact meaning of our pat¬ 
terns will be given later in Section 3. 

The last step consists of traversing the low-level expression 
and generating OpenCL code for each low-level pattern encoun¬ 
tered (Figure 2c). The two map patterns generate the for loops 
(line 3-4 and 7-8) that iterate over the input array assigning work 
to the workgroups and local threads. The information of how many 
chunks each workgroup and thread processes comes from the corre¬ 
sponding split. In line 11 the vectorized version of the user defined 
mul3 function (defined in line 1) is finally applied to the input array. 

To summarize, our approach is able to generate OpenCL code 
starting from a high-level representation of a program. This is 
achieved by systematically lowering the high-level expression into 
a low-level form suitable for code generation. The next two sections 
present our high-level and low-level patterns, the code generation 
mechanism and the rewrite rules in more details. 


































Pattern 

Type 

Description 

map(f) 

T[n] U[n]J -T^U 

Apply function / to every element of the input array. 

reduce(f z) 

T[]^T[llf ■.{T,T)^T,z-.T 

Apply the reduction function / with initial value z to the input array. 

zip(a,b) 

a : T[n],h-. U[n] -)■ {T,U)[n] 

Zip two arrays into an array of pairs. 

split" 

T[m\[]* r[m/n][n][]* 

Splits the outer most dimension of an array in chunks of size n. 

join 

T[m\[n\[]* T[m*n\[]* 

Joins the two outer most dimensions of an array. 

iterate" (f) 

T[]^T[],f -.TU^TU 

Iterate the function / over the input n times. 

reorder 

T[n] ^ T[n] 

Reorder the element of the input array. 


Table 1: High-level algorithmic patterns used by the programmer. T ^ U means the function input type is T and output type U. 
We write T\n] for an array of type T with size n and [ ]* denotes an arbitrary number of dimensions in an array. 


Pattern 

Type 

Description 

map-workgroup(f) identical to map(f) 

Each workgroup applies function / on a different element of the input array. 

map-local(f) 

identical to map(f) 

Each local thread applies function / on a different element of the input array. 

map-global(f) 

identical to inap(f) 

Each global thread applies function / on a different element of the input array. 

map-seq(f) 

identical to niap(f) 

Apply function / to every element of the input array sequentially. 

reduce-seq(fz) 

identical to reduce(f,z) 

Apply reduction function / with initial value z to the input sequentially. 

reorder-stride” 

identical to reorder 

Access input array with a stride s to maintain memory coalescing. 

toLocal(f) 


U[ ] Change the storage location for the results of function / to local memory. 

toGlobal(f) 

T[]^U[]J -.Tl]^ 

U[ ] Change the storage location for the results of function / to global memory. 

asVector" 

T[ ] [m] Tn[] [m/n] Turns the elements of an array into vector type. 

asScalar 

rn[]*[m] —>■ T[ ]* [m * n] Turns the elements of an array into scalar type. 

vect"(f) 

T[n] ^ U[n], f ■. T U Vectorize the function / by a factor n. 


Table 2: Low-level OpenCL patterns used for code generation. The hardware paradigm used is highlighted in bold in the description. 


3. Patterns Design and Implementation 

One of the key ideas of this paper is to expose algorithmic choices 
and hardware-specific program optimizations as patterns that can 
be systematically derived using a rule rewriting system (discussed 
later in in Section 4). The high-level algorithmic patterns are de¬ 
signed to be used by the programmer directly. The low-level hard¬ 
ware patterns represent hardware specific concepts expressed by a 
low-level programming model such as OpenCL, the target chosen 
for this paper. Following the same approach, a different set of low- 
level hardware patterns could be designed to target other low-level 
programming models such as Pthreads or MPI. 

This section discusses the design of our patterns and how we 
generate OpenCL code for them. We define our patterns as func¬ 
tions which are implicitly applied to exactly one input array and 
produces one output array. To simplify our implementation we de¬ 
cided to encode all types as arrays with primitives represented with 
arrays of length 1. The only exceptions are the user-provided func¬ 
tions such as the mul3 function in Figure 2a that operates on a prim¬ 
itive type. 

3.1 Algorithmic Patterns 

Table 1 presents our high-level algorithmic patterns. These patterns 
are not tied to any specific hardware feature and are used to define 
the program at the algorithmic level by the programmer. 

Map The map pattern is well known in functional programming 
and applies a given function / to all elements of its input array. 

Reduce The reduce pattern (a.k.a. fold or accumulate) uses a 
given binary function / to combine all elements of the input array. 
We require the function / to be associative and commutative which 
allows for an efficient parallel implementation. 

Zip and Split/Join These patterns transform the shape of the data 
and we store this information, i.e., number of dimensions and size 
of each dimension, in the type system. The zip pattern fuses two 


arrays into an array of pairs. The split pattern, which is most of¬ 
ten combined with SLjoin, partitions an array into chunks of spe¬ 
cific size resulting in an extra dimension. The corresponding join 
pattern does the opposite; it reassembles arrays of arrays by merg¬ 
ing dimensions. These two patterns used together are similar to the 
split-join concept from data flow languages such as Streamit [39]. 

Iterate The iterate pattern corresponds to the mathematical defi¬ 
nition of iteratively applying a function. It is defined as: = id 

and = f" ° f- In terms of implementation, our code gener¬ 
ator emits a for-loop to perform the iteration, and two pointers for 
input and output. After each iteration, we swap the pointers, so that 
the output of the last iteration becomes the input for the next one. 

Reorder The reorder pattern is used to specify that the ordering 
of the elements of an array does not matter. This allows our system 
to reorder arbitrarily the elements of an array and might enable 
optimizations, as we will see later. 

3.2 OpenCL-speciflc Patterns 

It is well known, that programming parallel hardware such as 
manycore CPUs and GPUs is quite complex. In order to achieve 
the highest performance, programmers often use a set of rules of 
thumb to drive the optimization of their application for the spe¬ 
cific devices. In fact each hardware vendor provides optimization 
guides [1, 27] that extensively cover hardware particularities and 
how to optimize code for them. 

In this paper we focus on the OpenCL programming model, 
which is a popular low-level programming model used to program 
manycore CPUs and GPUs. Programming these devices consists 
of writing a compute kernel in OpenCL C that executes on the 
device and writing the host code that orchestrates data movement, 
allocates memory and manages the execution on the device. 

We encode the OpenCL programming model by formaliz¬ 
ing hardware paradigms expressed as patterns. Table 2 gives an 
overview of the OpenCL-specific patterns we have identified. 








Parallel Maps The different map patterns represent possible 
ways of mapping computations to the hardware and exploit par¬ 
allelism in OpenCL. The map-workgroup(f) pattern assigns work 
to a group of threads, called workgroup in OpenCL, by letting ev¬ 
ery workgroup apply the function / on a different element of the 
input array. Similarly, the map-local(f) pattern assigns work to a 
local thread inside a workgroup. As workgroups are optional in 
OpenCL the map-global(f) pattern assigns work to a global thread, 
i.e., a thread not organized in a workgroup. This allows us to map 
computations in different ways to the thread hierarchy of OpenCL. 

The code generation for all these map patterns is similar, we de¬ 
scribe it using map-workgroup(f) as an example. A loop is gener¬ 
ated, where the iteration variable is determined by the workgroup- 
id, which is provided by OpenCL. Inside of the loop, a pointer is 
generated to partition the input array, so that every workgroup pro¬ 
cesses a different chunk of data. We use this pointer as the input for 
the function / being bound to the map. Similarly, we generate and 
use a pointer for the output. After emitting the code for the loop, 
we continue with the body of the loop by generating the code for 
the function /. When the generation of the body is finished, an ap¬ 
propriate synchronization mechanism for the given map pattern is 
added. For instance after a map-local we add a barrier synchroniza¬ 
tion to synchronize the threads inside of the workgroup. 

Sequential Map and Reduce The map-seq(f) and reduce-seq(f, z) 
patterns perform a sequential map and reduction, respectively, 
within a single thread. In both cases the generated code consists 
of a simple for loop iterating over the array and calling the function 
/. In case of the reduction an accumulation variable is initialized 
with 2 . The variable is then passed to the function / in each itera¬ 
tion and the computed result is stored in the accumulation variable 
and finally written to the output of the reduction. 

Reorder-stride Using this pattern the elements of an array are re¬ 
ordered with a stride s. In effect, this generates an access to an 
array such that out[t\ = in[i/n-\- s ■ {i mod n)\, where n • s is the 
size of the array. This hardware pattern ensures that after splitting 
the workload, consecutive threads access consecutive memory ele¬ 
ments (known as a coalesce memory access) which is beneficial on 
modem GPUs as it maximizes the memory bandwidth. 

Our implementation of this pattern does not produce code di¬ 
rectly, but generates instead an index function, which is used when 
accessing the array the next time. While not discussed here, our 
design allows us to support user-defined index functions as well. 

Local/Global The toLocal(f) and toGlobal(f) patterns are used to 
determine where the result of function / should be stored. OpenCL 
defines two distinct address spaces: global and local. Global mem¬ 
ory is the commonly used large but slow memory. On GPUs, the 
small local memory has a high bandwidth with low latency and 
is used to store frequently accessed data. With these two patterns, 
we can in effect exploit the memory hierarchy defined in OpenCL. 
These patterns act similarly to a typecast and are in fact imple¬ 
mented as such so that no code is emitted directly. 

In our design, every function reads its input and writes its output 
using pointers provided by the callee function. As a result, we can 
simply force a store to local memory by wrapping any function with 
our toLocal pattern. In the code generator, this will simply change 
the output pointer of function / to an area in local memory. 

Vectorize and asVector/asScalar The OpenCL programming 
model supports vectorization with special data types such as int4 
where any operations on this type will be executed in the hardware 
vector units. In the absence of vector units in the hardware, the 
OpenCL compiler scalarizes the code automatically. 

The asVector and asScalar patterns change the data type into 
vector elements and scalar elements respectively. For instance, in 


OpenCL an array of int is transformed into an array of int4 as 
seen in the motivation example (Figure 2). The vect^(f} pattern vec¬ 
torizes a function by simply converting all the operations in / that 
apply to vector types into vectorized operations. Our current im¬ 
plementation can only vectorize functions containing simple arith¬ 
metic operations such as -|- or —. In case of more complex func¬ 
tions, we rely on external tools [24] for vectorizing the operations. 

4. Rewrite Rules 

This section introduces our set of rewrite rules that transform high- 
level expressions written using our algorithmic patterns into seman¬ 
tically equivalent expressions. One goal of our approach is to keep 
each rule as simple as possible and only express one fundamental 
concept at a time. For instance the vectorization rule, as we will 
see, is the only place where we express the vectorization concept. 
This is different from most prior approaches that would produce a 
special vectorized version of different algorithmic patterns such as 
map or reduce. The superiority of our approach lies in the power 
of composition; many rules can be applied successively to produce 
expressions that compose hardware concepts or optimizations and 
that are provably correct by construction. 

Similarly to our patterns, we distinguish between algorithmic 
and lowering rules. Algorithmic rules produces derivations that rep¬ 
resent the different algorithmic choices and are shown in Figure 3. 
Figure 4 shows our OpenCL-specific rules which map expressions 
to OpenCL patterns. Once the expression is in its lowest form, it 
is possible to produce OpenCL code for each single pattern easily 
with our code generator as described in the previous section. 

4.1 Algorithmic Rules 

Iterate decomposition The rule in Figure 3a expresses the fact an 
iteration can be decomposed into several iterations. 

Reorder commutativity Figure 3b shows a rule stating that if the 
data can be reordered arbitrarily it does not matter if we apply a 
function / to each element before or after the reordering. 

Split-join The split-join rule in Figure 3c partitions a map into 
two maps. This allows us to nest map patterns in each other and, 
thus, map the computation to the thread hierarchy of the OpenCL 
programming model such as map-workgroup(map-local(f)) as seen 
in our motivation example (Figure 2). 

Reduction The reduction (and associated partial reduction) in 
Figure 3d is currently our most complex rule but also the most 
powerful one. It expresses the reduction function as a composition 
of other primitive functions, which is a fundamental aspect of our 
work. From the algorithmic point of view we first define a partial 
reduction pattern part-red. This partial reduction reduces an array 
of n elements to an array of m elements where 1 < m < n. The 
reduction can be derived in a partial reduction combined with a full 
reduction which ensures we end up with one unique element. 

Partial Reduction The first possible derivation for partial reduc¬ 
tion, in Figure 3d, leads to the full reduction which means m = 1. 
The next possible derivation expresses the fact that it is possible 
to reorder the elements to be reduced, expressing the commutativ¬ 
ity property of our definition of reduction. The third derivation is 
actually the only place where parallelism is expressed in the defi¬ 
nition of our reduction pattern. This rule expressed the fact that it 
is valid to partition the input elements first and then reduce them 
independently. Finally, the last possible derivation expresses the 
notion that it is possible to perform a partial reduction with an it¬ 
erative process by repetitively applying the same partial reduction 
function. This concept is very important when considering how the 
reduction function is typically implemented on a GPU (iteratively 
reducing within a workgroup using the local memory). 



iterate"'^"(f) —>■ iterate"'(f) o iterate"(f) 


(a) Iterate decomposition 



(f) Fusion rules 


Figure 3: Algorithmic rules. Bold patterns are known to the code 
generator. 

Simplification Rules Figure 3e shows our simplification rules. 
They express the fact that consecutive split-join pairs and asVector- 
asScalar pairs are equivalent to the identity function id. 

Fusion Rules Finally, our fusion rules are shown in Figure 3f. 
The first rule fuses the functions applied hy two consecutive maps. 
The second rule fuses the map-reduce pattern by creating a lambda 
function that is the results of merging function / and g from the 
original reduction and map respectively. This rule only applies to 
the sequential version since this is the only implementation not 
requiring the associativity property required by the more generic 
reduce pattern. When generating code, these rules in effect allow 
us to fuse the implementation of the different functions and avoid 
having to store temporary results. More generic rules for fusion 
have been studies by the functional programming community [10, 
23]. However, as we currently focus on a restricted set of patterns 
our simpler fusion rules have, so far, proven to be sufficient. 

4.2 OpenCL-Specific Rules 

Figure 4 shows our OpenCL-specific rules that are used to apply 
OpenCL optimizations and to lower high-level concepts down to 
OpenCL-specific ones. Patterns that are known to the code genera¬ 
tor are shown in bold in both Figure 3 and 4. 

Maps The rule in Figure 4a is used to produce OpenCL-specific 
map implementations that match the thread hierarchy of the 
OpenCL programming model. Our implementation maintains con¬ 
text information to ensure the thread hierarchy is respected. For in¬ 
stance, it is only legal to nest a map-local inside a map-workgroup. 

Reduction There is only one lowering rule for reduction (Fig¬ 
ure 4b), which expresses the fact that the only OpenCL implemen¬ 
tation known to the code generator is a sequential reduction. Possi¬ 
ble parallel implementations of the reduction pattern are defined at 
a higher level by composition of other algorithmic patterns. To the 


map(f) map-workgroup(f) \ map-local(f) 

I map-global(f) \ map-seq(f) 


(a) Map 



(e) Vectorization 


Figure 4: OpenCL-specific rules. Bold patterns are known to the 
code generator. 


best of our knowledge, all other existing high performance compil¬ 
ers treat the reduction directly as an irreducible primitive operation. 
The power of our approach is that the code generator implementa¬ 
tion only needs to know about the simple sequential reduction. As 
a result, it is possible to explore different implementation for the 
reduction by simply applying different rules. 

Reorder Figure 4c presents the rule that reorders elements of 
an array. In our current implementation, we support two types of 
reordering: no reordering, represented by the id identify function, 
and reorder-stride which reorders elements with a certain stride s. 
As described earlier, the major use case for the stride reorder is to 
enable coalesced memory accesses. 

Local/Global Figure 4d shows two rules that enable GPU local 
memory usage. They express the fact that the result of a map-local 
can always be stored in local memory or back in global memory. 
This holds since a map-local always exists within a map-workgroup 
for which the local memory is defined. These rules allow us to 
determine how the data is mapped to the GPU memory hierarchy. 

Vectorization Finally, Figure 4e shows the vectorization rule. 
Vectorization is achieved by using the asVector and correspond¬ 
ing asScalar which changes the element type of an array and adjust 
the length accordingly. This mle is only allowed to be applied once 
to a given map(jj pattern. This constrain can easily be checked by 
looking at the function’s type; if it is a vector type, the rule cannot 
be applied. Another set of rules, not shown here for space reason, 
are used to propagate the vect”" function recursively within /. 

4.3 Summary 

The power of our approach lies in the composition of our rules that 
produce complex low-level expressions from simple high-level ex¬ 
pressions. Looking back at our motivation example in Figure 2, we 
see how a simple algorithmic pattern such as map can effectively be 
derived into a low-level expression by applying the rules. This ex¬ 
pression matches various hardware concepts expressible with the 
OpenCL programming model such as mapping computation and 
data to the GPU thread and memory hierarchy and vectorization. 
Each single rule encodes a simple, easy to understand, provable 
fact. By composition of the rules we systematically derive low-level 
expressions which are semantically equivalent to the high-level ex¬ 
pressions by construction. This results in a powerful mechanism to 
safely explore the space of possible implementations. 
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def 

add{x, 

y) 

= X + y 



2 

def 

mult(x, 

y) 

= x * y 



3 

def 

abs(x) 


= if (x < 0) 

x else X 

5 

def 

scal{a, 

x) 

= map{mult(a) 

x) 


6 

def 

asumlx) 


= reduce(add, 

0) o 

map(abs, x) 

7 

def 

dot (a:, 

y) 

= reduce(add, 

0) o 

map(mult) o zip{x, y) 

8 

def 

gemv (A 

X 

y, a, b) = 



9 


z = map( 

scal(a) o dot(x), 

A) 


10 

mapladd) 

o 

zip( z, scal(b 

y) 

) 


Figure 5: Linear algebra kernels from the BLAS library expressed 
using our high-level algorithmic patterns. 

5. Benchmarks 

We now discuss how applications from linear algebra, mathemat¬ 
ical finance and physics can be represented as expressions com¬ 
posed of our high-level algorithmic patterns. We use the following 
conventions to simplify the syntax: non-capitalized letters (e.g.,x) 
denote scalar variables, letters with an arrow on top (e.g., x) denote 
ID vectors, and capitalized letters (e.g.. A) denote 2D matrices. 

5.1 Linear Algebra Kernels 

We choose linear algebra kernels as our first set of benchmarks, be¬ 
cause they are well known, easy to understand, and used as building 
blocks in many other applications. Figure 5 shows how we express 
vector scaling (line 5), sum of absolute values (line 6), dot product 
of two vectors (line 7) and matrix vector multiplication (line 8- 
10) using our high-level patterns. While the first three benchmarks 
perform computations on vectors, matrix vector multiplication was 
chosen to illustrate a computation using a 2D data structures. 

For scaling (line 5), the map pattern applies a function to each 
element which multiplies it with a constant. This function is ex¬ 
pressed by partially applying the mult function, i.e., binding a to 
the first argument of mult. The sum of absolute values (line 6) and 
the dot product (line 7) applications both produce scalar results by 
performing a summation, which we express using the reduce pat¬ 
tern combined with the addition. For dot product, a pair-wise mul¬ 
tiplication of its two input vectors is performed before applying the 
reduction. This is expressed using the zip and map patterns. 

Line 8-10 shows the implementation of matrix vector multipli¬ 
cation as defined by the BLAS library: y = aAx -\- Py. To mul¬ 
tiply matrix A with vector x, the map pattern maps the computa¬ 
tion of the dot-product with the input vector x to each row of the 
matrix A (line 9). Notice how we are reusing the high-level expres¬ 
sions for dot-product and scaling as building blocks for the more 
complex matrix-vector multiplication. This shows the power of our 
system: expressions describing algorithmic concepts can be reused, 
without committing to a particular low-level implementation; The 
dot-product from gemv (line 9) might be implemented in a totally 
different way from the stand-alone dot-product kernel (line 7). 

5.2 Mathematical Finance Application 

The BlackScholes application uses a Monte-Carlo method for op¬ 
tion pricing and computes for each stock price s a pair of call and 
put options {c, p}. Figure 6 shows the BlackScholes implemen¬ 
tation, where the function defined in line 1 computes the call and 
put option for a single stock price s. Two intermediate results dl 
and d2 are computed and used to compute the call and put options 
which are returned as a single pair. The compDl, compD2, compCall 
and compPut functions are not shown here since they only contain 
purely sequential code implementing the BlackScholes model. This 
BSComputation function is applied to all stock prices, stored in a 
vector s, using the map pattern in line 4. 


1 def BSComputation(s) = 

2 dl = compDl(s); d2 = compD2(dl,s) 

3 return { compCall(dl,d2,s), compPut(dl,d2,s) } 

4 def blackScholes(s) = map{BSComputation, s) 


Figure 6: BlackScholes mathematical finance application ex¬ 
pressed using our high-level algorithmic patterns. 


1 def updateF(f, nid, p, p, t) = 

2 n = p[nld]; d = calculateDistrance(p, n) 

3 if (d < t) f += calculateForce{d) 

4 return f 

5 def md(p, N, t) = map( 

6 A p, n: reduce(A f, nId: updateF(f,nId,p,p,t), 0, n) 

7 ) o zip(p, N) 


Figure 7: Molecular dynamics physics application expressed using 
our high-level algorithmic patterns. 


5.3 Physics Application 

Another application we consider is the the molecular dynamics 
(MD) application from the SFIOC [12] benchmark suite. It calcu¬ 
lates the sum of all forces acting on a particle from its neighbors. 
Figure 7 shows the implementation using our high-level patterns. 

The function updateF is defined in line 1 and updates the force 
f of particle p by computing and adding the local force between a 
single particle and one of its neighbors. updateF takes an index of a 
neighbor nid, the vector storing all particles p, and a threshold t as 
additional parameters. Using nid and p the neighboring particle is 
accessed in line 2 and the distance between the neighboring particle 
and the particle p ist computed. If the distance is below the given 
threshold t the local force between the two particles is calculated 
based on the distance and added to the overall force f (line 3) which 
is finally returned in line 4. Otherwise the particle is ignored in the 
summation. 

For computing the force for all particles p, we use the zip pat¬ 
tern (line 7) to build a vector of pairs. Each pair combines a sin¬ 
gle particle with the indices of all of its neighboring particles. The 
function which is applied to each pair by the map pattern (line 5) is 
expressed as an lambda expression (line 6). Computing the result¬ 
ing force exerted by all the neighbors on one particle is done by ap¬ 
plying the reduce pattern on vector n which stores the indices of the 
neighboring particles. We use the previously defined function up¬ 
dateF inside the reduction to compute the force each particle with 
index nid add to the overall force on p. At this point we fix all but 
the first two arguments as the other arguments remain constant for 
particle p. The usage of lambda expressions in our system allows 
for easy binding of additional information as arguments to func¬ 
tions. This application example should give some evidence that our 
patterns are flexible enough to implement real world applications. 

6. Deriving Specialized Implementations 

This section shows how our rules can be applied to derive different 
implementations starting from the same high-level expression. We 
illustrate this process using the asum benchmark from the previous 
section as a simple example. The computation can easily be ex¬ 
pressed using two of our high-level algorithmic patterns, as shown 
in Figure 8 (1). The abs function is applied to every element of the 
input vector x and then the intermediate result is summed up using 
the reduce pattern which is customized with the addition operator. 











asum(x) = reduce{+,0) o map{abs,x) (1) 

= reduce{+, 0) o join o map{part-red{+, 0)) o split" o map{abs, x) (2) 

= reduce[-\-, 0) o join o map(part-red{+, 0)) o split" o join o map{map{abs)) o split"(x) (3) 

= reduce{+, 0) o join o map(part-red{+, 0)) o map{map{abs)) o split"(x) (4) 

= reduce(+,0) o join o map{part-red{+,0) o map{abs)) o split"(x) (5) 

= reduce{+, 0) o join o map(part-red{+, 0) o map-seq{abs)) o split"{x) (6) 

= reduce[-\-, 0) o join o map{reduce-seq{+, 0) o map-seq{abs)) o split" {x) (7) 

= reduce{+, 0) o join o map{reduce-seq{\ acc, a : acc + abs{a), 0) o split"{x) (8) 


Figure 8: Derivation for asum(x) to a fused version. The numbers above the equality sign refer to the rules from Figure 3 and Figure 4. 


def asum{a?) = reduce-seq o join o map-workgroup! 

join 0 toGlobai! map-local( map-seq(id) ) ) o split-1 
0 iterate-?! join o map-local! reduce-seq!plus, 0) ) o split-2 ) 

0 join 0 toLocal! map-local! reduce-seq!absAndPlus, 0) ) ) o split-2048 o reorder-stride 
) 0 split-262144!i) 


def asum!i?) = reduce-seq o join o asScalar o map-workgroup! 

join 0 toGlobal! map-local! map-seq!vectorize-4!id) ) ) ) o split-1 
0 iterate-8! join o map-local! reduce-seq!vectorize-4|plus), vectorize-4!0)) ) o split-2 ) 

0 join 0 toLocal! map-local! reduce-seq|vectorize-4!absAndPlus), vectorize-4!0)))) o split-2 o reorder-stride 
) 0 as\/ector-4 o split-2048!x) 


def asum!x) = reduce-seq o join o asScalar o map-workgroup! 

join 0 map-local! reduce-seq!vectorize-4|absAnd+), vectorize-4!0)) ) o split-8192 
) 0 as\/ector-4 o split-32768!x) 


Figure 9: Low-level expressions performing the sum of absolute values specialized for Nvidia (a), AMD (b), and Intel (c). These expressions 
are systematically derived by our system from the high-level expression reduce{+, 0) o map{abs, x). 


6.1 Deriving a Fused Implementation 

To achieve good performance it is in general beneficial to avoid 
storing intermediate results. Rule 3f allows us to apply this princi¬ 
ple and fuse two patterns into one, thus, avoiding an intermediate 
result. Figure 8 shows how we can systematically derive a fused 
version of the asum application from the high-level expression writ¬ 
ten by the programmer. We write the derivation as a sequence of 
equations using a slightly more mathematical notation, where the 
numbers above the equality sign refer to the rules applied. 

To obtain expression (2) we apply the reduction rule 3d twice: 
first to replace reduce with reduce o part-red and then a second 
time to expand par-red. Afterwards, we expand map to get (3), 
which can be simplified by removing the two corresponding join 
and split patterns. In the step from (4) to (5) two map patterns are 
fused and in the next step the nested map is lowered into the map- 
seq pattern to obtain (6). By first transforming part-red back into 
reduce (using rule 3d) and then applying the lowering rule 4b we 
get (7). Finally, we apply rule 3f to fuse the map-seq and reduce-seq 
into a single reduce-seq. This sequence of transformations results 
in expression (8) which allows for a more optimal implementation 
since no temporary storage is required for the intermediate result. 

6.2 Deriving Device Specific Implementations 

The previous section showed how an optimization can be systemat¬ 
ically applied which is generally beneficial on every hardware plat¬ 
form. However, there exist many optimizations which are highly 
specific to a particular hardware architecture. For instance it is of¬ 
ten beneficial to apply vectorization on an Intel CPU but not on an 
Nvidia GPU, on the contrary using local memory is usually benefi¬ 
cial on GPUs but not on CPUs. Figure 9 shows three different im¬ 
plementations of the asum benchmark which have been derived us¬ 
ing the same systematic approach of applying rules as seen in Fig¬ 


ure 8. These implementations have been in fact inspired by hand- 
tuned OpenCL and CUDA kernels from the different vendors. This 
demonstrates the expressive power of our OpenCL patterns. Each 
implementation is optimized to take advantage of the features of a 
particular hardware architecture. The integer parameters deciding 
how to split the data and the width of vectorization where chosen 
by exploring different values empirically. 

The first implementations, shown in Figure 9a, is optimized 
for an Nvidia GPU. The input vector x is split in large chunks 
which are processed in parallel by different workgroups. Inside 
each workgroup the reorder-stride pattern ensures fast coalesced 
memory accesses when loading the data from global memory. Each 
local thread reduces 2048 elements and stores the intermediate 
result in local memory. Afterwards, the entire workgroup performs 
an iterative computation to reduce the intermediate results down 
to a single result before this is copied back to global memory. 
Eigure 9b shows the AMD optimized implementation which is 
similar to the previous one. The same set of optimizations have 
been applied to take advantage of the local memory and to ensure 
coalesced memory accesses. In addition, since the AMD GPU has 
vector units, the reduction has been vectorized by a width of four. 

The third implementation as seen in Eigure 9c, is targeted at 
an Intel CPU and is very different. It neither uses local memory 
nor the reorder-stride pattern. Vectorization is applied, similar 
to the AMD version, but in contrast, the implementation uses dif¬ 
ferent numbers for partitioning the data for workgroup and local 
threads; only a single thread is active inside each workgroup This 
corresponds to the fact, that there is less parallelism available on 
a CPU compared to GPUs. These three different implementations 
derived from the same high-level expression should give some evi¬ 
dence of the power of our approach which is able to systematically 
derive highly hardware-specific implementations. 










6.3 Towards Automatic Derivation 


We have shown how our system can systematically transform and 
optimize programs at an algorithmic level and at a hardware level 
without performing complex compiler analysis. While manually 
deriving the expression is a tedious process, our vision is for this 
to take place in a fully automated way following the principles in¬ 
troduced in our work. Our rules and OpenCL-specific patterns pave 
the way to a fully automatic search strategy starting from the high- 
level expression. There has been other work in this area ranging 
from random search to sophisticated search strategies based on per¬ 
formance models or machine learning techniques [13, 29]. We see 
this work as completely orthogonal to this paper, since our first 
focus is to develop the systematic foundations necessary to apply 
such techniques. Nonetheless, we have implemented a prototype 
automatic search technique that is actually able to find expressions 
with similar performance to those presented in Figure 9. This sug¬ 
gest it is possible to automatically derive highly tuned implemen¬ 
tations from high-level expressions but this is left for future work. 

7. Experimental Setup 

This section describes some implementation details of our code 
generator and our experimental setup used for the experiments. 

7.1 Implementation Details 

Our system is implemented in C-^+ll, using the template system 
and support for lambda functions. When generating code for a 
derived expression two basic steps are performed. First, we use the 
Clang/LLVM compiler library to parse the input expression and 
produce an abstract syntax tree for it. Second, we traverse the tree 
and emit code for every function call representing one of our low- 
level hardware patterns. 

As part of the first step, we have developed our own type sys¬ 
tem which plays a dual role. First, it prevents the user to produce 
incorrect expressions. Secondly, the type system encodes informa¬ 
tion necessary for code generation, such as memory address space 
and array size information, which are used to allocate memory. 

The design of our code generator is straightforward since no 
optimization decisions are made at this stage. We avoid performing 
complex analysis of the code which makes our design very different 
compared to traditional optimizing compilers. 

7.2 Hardware Platforms and Evaluation Methodology 

We used three hardware platforms to perform the runtime experi¬ 
ments: an Nvidia GeForce GTX 480 GPU, an AMD Radeon HD 
7970 GPU and a dual socket Intel Xeon E5530 server, with 8 cores 
in total and hyper-threading enabled. We used the latest OpenCL 
runtime from Nvidia (CUDA-SDK 5.5), AMD (AMD-APP 2.8.1) 
and Intel (XE 2013 R3 3.2.1.16712). The GPU drivers installed on 
our Linux system were 310.44 for Nvidia and 13.1 for AMD. 

We use the profiling APIs from OpenCL and CUDA to measure 
kernel execution time and the gettimeofday function for the CPU 
implementation. Poliowing the Nvidia benchmarking methodol¬ 
ogy [20], the data transfer time to and from the GPU is excluded. 
We repeat each experiment 1000 times and report median runtimes. 

Por our linear algebra benchmarks, we have performed experi¬ 
ments with two input sizes. For seal, asum and dot, the small in¬ 
put size corresponds to a vector size of 16M elements (64MB). 
The large input size uses 128M elements (512MB, the maximum 
OpenCL buffer size for our platforms). For gemv, we use an input 
matrix of 4096x4096 elements (64MB) and a vector size of 4096 
elements (16KB) for the small input size. For the large input size, 
the matrix size is 8192x 16384 elements (512MB) and the vector 
size 8192 elements (32KB). For BlackScholes, the problem size is 
fixed to 4 million elements and for MD it is 12288 particles. 
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Figure 10: Performance of our approach relative to a portable 
OpenCL reference implementation. We outperform the clBLAS 
implementation on most benchmarks and platforms. 

8. Results 

We now evaluate our approach compared to reference OpenCL 
implementations of our benchmarks on all platforms. Furthermore, 
we compare the BLAS routines against platform-specific tuned 
implementations. 

8.1 Comparison vs. Portable Implementation 

We want to show how our approach performs across three plat¬ 
forms. We use the BLAS OpenCL implementations written by 
AMD as our baseline for this evaluation since it is inherently 
portable across our different platforms. Figure 10 shows the per¬ 
formance of our approach relative to clBLAS for the BLAS rou¬ 
tines. As can be seen, we achieve better performance than clBLAS 
on most platforms and benchmarks. The speedups are the highest 
on the CPU, with up to 20 x for the asum benchmark with a small 
input size. The reason is that clBLAS was written and tuned specif¬ 
ically for an AMD GPU which usually exhibit a larger number of 
parallel processing units. As we saw in Section 6.2, our systemat¬ 
ically derived expression for this benchmark is specifically tuned 
for the CPU by avoiding creating too much parallelism, which is 
what gives us such large speedup. 

Figure 10 also shows the results we obtain relative to the Nvidia 
SDK BlackScholes and SHOC molecular dynamics MD bench¬ 
mark. For BlackScholes, we see that our approach is on par with the 
performance of the Nvidia implementation on both GPUs. On the 
CPU, we actually achieve a 2.2 x speedup due to the fact that the 
Nvidia implementation is tuned for GPUs while our implementa¬ 
tion generates different code for the CPU. For MD, we are actually 
on par with the OpenCL implementation on all platforms. 

8.2 Comparison vs. Highly-tuned Implementations 

We now compare our approach with a highly-tuned implementation 
for each platform. For Nvidia, we pick the highly tuned CUBLAS 
CUDA-specific implementation of BLAS written by Nvidia. For 
the AMD GPU, we use the same clBLAS implementation as before 
given that it has been written and tuned specifically for AMD 
GPUs. Finally, for the CPU we use the Math Kernel Library (MKL) 
implementation of BLAS written by Intel which is known for its 
high performance. 

Figure 11a shows that we actually match the performance of 
CUBLAS for seal, asum and dot on the Nvidia GPU. For gemv 
we outperform CUBLAS on the small size by 20% while we are 
within 5% for the large input size. Given that CUBLAS is a pro¬ 
prietary library highly tuned for Nvidia GPUs, these results should 
offer some confidence that our technique is able to achieve high 
performance. 
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Figure 11: Performance comparison of our approach relative to a highly-tuned platform-specific library; CUBLAS for Nvidia, clBLAS for 
AMD and MKL for the CPU. Our approach matches the performance of CUBLAS and MKL, and outperforms clBLAS on some routines. 


On the AMD GPU, we are surprisingly up to 4.5 x faster than 
the clBLAS implementation on gemv small input size as shown in 
Figure 1 lb. The reason for this is found in the way clBLAS is im¬ 
plemented. For the gemv benchmark, clBLAS performs automatic 
code generation using fixed templates. In contrast to our approach 
they only generate one implementation since they do not explore 
different pattern compositions. 

For the Intel CPU (Figure 11c), we see that our approach beats 
MKL for one benchmarks and match the performance of MKL 
on most of the other three benchmarks. For the small input sizes 
for the seal and dot benchmarks we are within 13% and 30% 
respectively. For the larger input sizes we are on par with MKL for 
both benchmarks. The asum implementation in the MKL does not 
use thread level parallelism, where our implementation does and, 
thus, achieves a speedup of up to 1.78 on the larger input size. 

This section has shown how our approach can generate perfor¬ 
mance portable code that is competitive with highly-tuned platform 
specific implementations. 

9. Related Work 

Algorithmic Patterns Algorithmic patterns or skeletons [8] have 
been around for more than two decades. Pattern-based libraries for 
platforms ranging from cluster systems [34] to GPUs [37] have 
been proposed with recent extension to irregular algorithms [17]. 
This includes popular framework such as Map-Reduce [14] from 
Google. Many researchers have looked at the problem of optimiz¬ 
ing map-reduce operations for different type of hardware. Para- 
prox [35] for instance uses automatic detection of algorithmic pat¬ 
terns to apply optimization at the expense of accuracy. Compared 
to our approach, most prior works rely on hardware-specific imple¬ 
mentations to achieve high performance. Conversely, we systemat¬ 
ically generate implementations using fine-grain OpenCL patterns 
combined with our rule rewriting system. 

Functional Approaches for GPU Code Generation Accelerate 
is a functional domain specific language built within Haskell to 
support GPU acceleration [7, 26]. Recently, Nvidia has presented 
NOVA [9], a new functional language target at code generation 
for GPUs, and Copperhead [5], a data parallel language embed¬ 
ded in Python. NOVA shares many concepts from Accelerate and 
Copperhead and offers familiar data parallel patterns. HiDP [43] 
is a hierarchical data parallel language which maps computations 
to the OpenCL programming model similar to our approach. All 
these projects rely on analysis of user code or hand-tuned ver¬ 
sions of high-level algorithmic patterns. In contrast, our approach 
uses rewrite rules and low-level hardware patterns to produce high- 
performance code in a portable way. 


Halide [31] is a domain specific approach targeting image pro¬ 
cessing pipelines. It separates a programs functional algorithmic 
description from optimization decisions and applies autotuning to 
find the best optimization on different hardware platforms. Our 
work is domain agnostic and takes a different approach to achieve 
high performance. We systematically describe hardware paradigms 
as functional patterns instead of encoding specific optimizations 
which might not apply to future hardware generations. 

Rewrite-rules for Optimizations Rewrite rules have been used 
very early as a way to automate the optimization process of func¬ 
tional programs [23]. Recently, rewriting has been applied to HPC 
applications [28] as well, where the user annotates imperative code 
providing information necessary for the rewrite process. Similar to 
us. Spiral [30] also uses rewrite rules to optimize signal processing 
programs and was more recently adapted to linear algebra [36] and 
other mathematical domains [ 1 6] . In contrast our rules and OpenCL 
hardware patterns are expressed at a much finer level, allowing for 
highly specialized and optimized code generation. 

High-level Code Generation for GPUs A large body of work 
has explored how to automatically generate high performance 
code for GPUs. Dataflow programming models such as IBM’s 
LiquidMetal [15] or Streamit [39] have been used to automati¬ 
cally produce GPU code with OpenCL or CUDA [21, 22, 40]. 
Directive based approach have also been used such as OpenMP to 
CUDA [25], OpenACC to OpenCL [33], or hiCUDA [19] which 
translates sequential C code to CUDA. Optimized implementations 
for directive based reductions on GPUs has been presented [42] as 
well. XIO [38], a language for high performance computing, can 
also be used to program GPUs [11]. However, the programming 
style remains low-level since the programmer has to express the 
same low-level operations found in CUDA or OpenCL. Recently, 
researchers have looked at generating efficient GPU code for loops 
using the polyhedral framework [18,41]. Delite [4,6], a system that 
enables the creation of domain-specific languages, can also target 
multicore CPUs or GPUs. Unfortunately, all these approaches do 
not provide full performance portability since the mapping of the 
application assumes a fixed platform and the optimizations and 
implementations are targeted at a specific device. 

Finally, Petabricks [2] takes a different approach by letting 
the programmer specify different algorithms implementations. The 
compiler and runtime then choose the most suitable one based 
on an adaptive mechanism and can produce OpenCL code [29]. 
Compared to our work, they generate optimized code by relying 
on static analysis. Our code generator does not make any decisions 
nor perform any analysis since the optimization process happens at 
a higher level within our rewrite rules. 

















































10. Conclusion 

In this paper, we have presented a novel approach based on 
rewrite rules to represent algorithmic principles as well as low- 
level hardware-specific optimization. We have shown how these 
rules can be systematically applied to transform a high-level ex¬ 
pression into a device-specific implementation. This results in a 
clear separation of concern between high-level algorithmic con¬ 
cepts and low-level hardware optimizations which pave the way 
for fully automated high performance code generation. 

To demonstrate the power of our approach in practice, we 
have developed OpenCL-specific rules and patterns together with 
an OpenCL code generator. The design of the code generator is 
straight-forward given that all optimizations decisions are made 
with the rules and no complicated analysis passes are needed. We 
achieve performance on par with highly tuned platform-specific 
BLAS libraries on three different devices; AMD GPU, Nvidia GPU 
and Intel CPU. For benchmarks such as matrix vector multiplica¬ 
tion we even reach a speedup of up to 4.5. We also show that our 
technique achieves portable performance for more complex ap¬ 
plications such as the BlackScholes benchmark or for molecular 
dynamics simulation. 
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